In this R Notebook we preprocess spatial and corresponding reference scRNA-seq data of human Idiopathic pulmonary fibrosis (IPF) lung for cell type deconvolution.
Spatial data preprocessing:
1.1 Input original data files
Here we provide a R rds file IPF_spatial_data.rds
containing the Seurat object of the IPF
lung sample we are working on. The Seurat object contains
raw nUMI and physical locations of
spatial spots, and a thumbnail of tissue image. In
total 4,992 spatial spots and 60,651 genes are included in the raw
data.
The raw data and tissue image have also been uploaded to GSE231385.
1.2 Output data files for cell type deconvolution
We filter out the spatial spots NOT covered by the tissue and the genes NOT expressed in any tissue coverd spots, remaining 3,532 spots and 32,078 genes.
x and y coordinates of spatial spots are
from col and row, respectively.Reference scRNA-seq data preprocessing:
2.1 Input original data files
raw nUMI count matrix and cell type annotation of scRNA-seq data can be downloaded from GSE136831, including 312,928 cells and 45,947 genes.
Here we only use cells from one IPF subject 225I,
and provide a R rds file IPF_scRNA_data.rds
containing a Seurat object with 12,070
cells and 60,651 genes. To process the scRNA-seq data of this subject,
we use STARsolo to map
the reads to reference genome (GRCh38), and refined the cell type
annotation. So both the nUMI count matrix and cell type annotation are
different with the published version in GSE136831.
Note that we set all entries of the sparse matrix in
data slot as 0 to decrease the file
size.
2.2 Output data files for cell type deconvolution
We select 26 major cell types to work on, and filter out cells of other cell types and genes NOT expressed in any selected cells, remaining 11,227 cells and 35,483 genes.
Raw nUMI of 11,227 cells with 26 cell types and 35,483 genes: IPF_ref_scRNA_cell_nUMI.csv.gz.
Cell type annotation for those 11,227 cells: IPF_ref_scRNA_cell_celltype.csv.
We also manually select 2,534 cell type specific marker genes from the filtered data. To use it in SDePER cell type deconvolution, we create a 26 * 2,534 matrix with all entries are 0, and set the selected 26 cell types as row names and 2,534 marker genes as column names. This cell type maker gene expression profile is saved in IPF_selected_2534_celltype_markers.csv.
version[['version.string']][1] "R version 4.2.2 Patched (2022-11-10 r83330)"
print(sprintf('Seurat package version: %s', packageVersion('Seurat')))[1] "Seurat package version: 4.1.1"
file_name = file.path(home.dir, 'IPF_spatial_data.rds')
org_data = readRDS(file_name)
print(sprintf('load data from %s', file_name))[1] "load data from /home/hill103/Documents/SharedFolder/ToHost/CVAE-GLRM_Analysis/RealData/IPF/IPF_spatial_data.rds"
print(sprintf('spots: %d; genes: %d', ncol(org_data), nrow(org_data)))[1] "spots: 4992; genes: 60651"
# add tissue indicator into meta data
total_spots = ncol(org_data)
tmp.df = org_data@images[[1]]@coordinates
tmp.df = tmp.df[colnames(org_data), ]
stopifnot(nrow(tmp.df) == ncol(org_data))
org_data[['tissue']] = tmp.df$tissue
# subset spots covered by tissue
org_data = subset(org_data, subset = tissue==1)
print(sprintf('spots covered by tissue: %d (%.2f%%)', ncol(org_data), ncol(org_data)/total_spots*100))[1] "spots covered by tissue: 3532 (70.75%)"
Tissue images highlight spots covered by tissue
Idents(org_data) = 'tissue'
SpatialDimPlot(org_data, crop = F, alpha=0.2, cols=c('1'='blue', '0'='red')) + NoLegend()Keep genes detected in >=1 spot
tmp = Matrix::rowSums(org_data@assays$spatial@counts)
need.genes = names(tmp)[tmp>0]
print(sprintf('finally keep %d genes', length(need.genes)))[1] "finally keep 32078 genes"
# subset genes
org_data = subset(org_data, features=need.genes)Violin plot of genes expressed in spots
tmp = data.frame(spot_sum=Matrix::colSums(org_data@assays$spatial@counts>0))
ggplot(tmp, aes(x='', y=spot_sum)) +
geom_violin(fill='orange', width=1) +
geom_jitter(size=0.1, position=position_jitter(0.1)) +
scale_y_continuous(limits=c(0, 9500), breaks=seq(0, 9500, 500)) +
labs(x='', y='#genes expressed in spot')Save spots and genes after filtering into file IPF_spatial_spot_nUMI.csv.gz. Rows as spatial spots and columns as genes.
data.table::fwrite(as.data.frame(as.matrix(Matrix::t(org_data@assays$spatial@counts))), 'IPF_spatial_spot_nUMI.csv.gz', row.names = T)
print(sprintf('save %d gene nUMIs of %d spatial spots into file %s', ncol(org_data), nrow(org_data), 'IPF_spatial_spot_nUMI.csv.gz'))[1] "save 3532 gene nUMIs of 32078 spatial spots into file IPF_spatial_spot_nUMI.csv.gz"
The x and y coordinates of spatial spots
are from col and row, respectively.
col and row are generated by 10x
Space Ranger. Save it into file IPF_spatial_spot_loc.csv.
local_df = org_data@images$x1@coordinates %>%
select(c('col', 'row'))
colnames(local_df) = c('x', 'y')
local_df[1:5, ]
write.csv(local_df, 'IPF_spatial_spot_loc.csv')
print(sprintf('save Physical Locations of spatial spots into file %s', 'IPF_spatial_spot_loc.csv'))[1] "save Physical Locations of spatial spots into file IPF_spatial_spot_loc.csv"
We define the neighborhood of a spatial spot contains the adjacent left, right, top and bottom spot, plus the second closest spots at left and right, that is, one spot has at most 6 neighbors.
The generated Adjacency Matrix A only contains
1 and 0, where 1 represents
corresponding two spots are adjacent spots according to the definition
of neighborhood, while value 0 for non-adjacent spots. Note all
diagonal entries are 0s.
Adjacency Matrix are saved into file IPF_spatial_spot_adjacency_matrix.csv.
getNeighbour = function(array_row, array_col) {
# based on the (row, col) of one spot, return the (row, col) of all 6 neighbours
return(list(c(array_row-1, array_col-1),
c(array_row-1, array_col+1),
c(array_row, array_col-2),
c(array_row, array_col+2),
c(array_row+1, array_col-1),
c(array_row+1, array_col+1)))
}
# adjacency matrix
A = matrix(0, nrow = nrow(local_df), ncol = nrow(local_df))
row.names(A) = rownames(local_df)
colnames(A) = rownames(local_df)
for (i in 1:nrow(local_df)) {
barcode = rownames(local_df)[i]
array_row = local_df[i, 'y']
array_col = local_df[i, 'x']
# get neighbors
neighbours = getNeighbour(array_row, array_col)
# fill the adjacency matrix
for (this.vec in neighbours) {
tmp.p = rownames(local_df[local_df$y==this.vec[1] & local_df$x==this.vec[2], ])
if (length(tmp.p) >= 1) {
# target spots have neighbors in selected spots
for (neigh.barcode in tmp.p) {
A[barcode, neigh.barcode] = 1
}
}
}
}
A[1:5, 1:5] x1__AAACAAGTATCTCCCA x1__AAACAATCTACTAGCA x1__AAACAGAGCGACTCCT x1__AAACAGTGTTCCTGGG x1__AAACATTTCCCGGATT
x1__AAACAAGTATCTCCCA 0 0 0 0 0
x1__AAACAATCTACTAGCA 0 0 0 0 0
x1__AAACAGAGCGACTCCT 0 0 0 0 0
x1__AAACAGTGTTCCTGGG 0 0 0 0 0
x1__AAACATTTCCCGGATT 0 0 0 0 0
write.csv(A, 'IPF_spatial_spot_adjacency_matrix.csv')
print(sprintf('save Adjacency Matrix of spatial spots into file %s', 'IPF_spatial_spot_adjacency_matrix.csv'))[1] "save Adjacency Matrix of spatial spots into file IPF_spatial_spot_adjacency_matrix.csv"
Plot Adjacency Matrix. Each node is spot, spots within neighborhood are connected with edges.
Note reverse y axis to to make the origin (0,0) at top left
g = graph_from_adjacency_matrix(A, 'undirected', add.colnames = NA, add.rownames = NA)
# manually set nodes x and y coordinates
vertex_attr(g, name = 'x') = local_df$x
vertex_attr(g, name = 'y') = local_df$y
# reverse y axis to make the (0,0) at top left
plot(g, vertex.size=2, edge.width=4, margin=-0.05, ylim=c(1, -1))file_name = file.path(home.dir, 'IPF_scRNA_data.rds')
ref_data = readRDS(file_name)
print(sprintf('load data from %s', file_name))[1] "load data from /home/hill103/Documents/SharedFolder/ToHost/CVAE-GLRM_Analysis/RealData/IPF/IPF_scRNA_data.rds"
celltype_category_order = c("Myeloid", "Endothelial", "Stromal", "Lymphoid", "Epithelial", "Multiplet")
print(sprintf('cells: %d; genes: %d', ncol(ref_data), nrow(ref_data)))[1] "cells: 12070; genes: 60651"
print(sprintf('total %d distinct cell types', length(unique(ref_data$celltype))))[1] "total 44 distinct cell types"
cell count by cell types
meta.data = ref_data@meta.data
meta.data = meta.data %>% group_by(celltype_category, celltype) %>% summarise(count=n())`summarise()` has grouped output by 'celltype_category'. You can override using the `.groups` argument.
meta.data$celltype_category = factor(meta.data$celltype_category, levels = celltype_category_order)
ggplot(meta.data, aes(x=celltype, y=count, label=count)) +
geom_bar(position=position_dodge2(width=0.9, preserve="single"), stat="identity") +
geom_text(position=position_dodge2(width=0.9, preserve="single"), vjust=0.5, hjust=-0.1, angle=90) +
theme_bw() +
xlab("") +
ylab("# of Cells") +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1),
legend.position="bottom",
strip.placement = "outside",
strip.background = element_rect(fill=NA, colour="grey50"),
panel.spacing=unit(0,"cm")) +
facet_grid(~ celltype_category, space="free_x", scales="free_x", switch="x")We only focus on 26 major cell types
need_celltypes = c("Macrophage", "Macrophage_Alveolar", "cDC1", "cDC2", "cMonocyte", "ncMonocyte", "Mast",
"VE_Arterial", "VE_Venous", "VE_Capillary_A", "VE_Capillary_B", "Lymphatic",
"Fibroblast-Adventitial", "Fibroblast-Airway", "Fibroblast-Alveolar", "Pericyte-Alveolar", "SMC-Vascular",
"T", "B", "NK",
"Ciliated", "Basal", "Goblet", "ATI", "ATII", "AberrantBasaloid")
ref_data = subset(ref_data, subset = celltype %in% need_celltypes)
print(sprintf('remain cells: %d; genes: %d', ncol(ref_data), nrow(ref_data)))[1] "remain cells: 11227; genes: 60651"
cell count of those 26 major cell types
meta.data = ref_data@meta.data
meta.data = meta.data %>% group_by(celltype_category, celltype) %>% summarise(count=n())`summarise()` has grouped output by 'celltype_category'. You can override using the `.groups` argument.
meta.data$celltype_category = factor(meta.data$celltype_category, levels = celltype_category_order)
meta.data$celltype = factor(meta.data$celltype, levels = need_celltypes)
ggplot(meta.data, aes(x=celltype, y=count, label=count)) +
geom_bar(position=position_dodge2(width=0.9, preserve="single"), stat="identity") +
geom_text(position=position_dodge2(width=0.9, preserve="single"), vjust=0.5, hjust=-0.1, angle=90) +
theme_bw() +
xlab("") +
ylab("# of Cells") +
theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1),
legend.position="bottom",
strip.placement = "outside",
strip.background = element_rect(fill=NA, colour="grey50"),
panel.spacing=unit(0,"cm")) +
facet_grid(~ celltype_category, space="free_x", scales="free_x", switch="x") +
ylim(0, 5400)We excludes genes NOT expressed in any cells within the 26 major cell types
gene_sum = Matrix::rowSums(ref_data@assays$RNA@counts)
keep_genes = names(gene_sum)[gene_sum>0]
ref_data = subset(ref_data, features = keep_genes)
print(sprintf('remain cells: %d; genes: %d', ncol(ref_data), nrow(ref_data)))[1] "remain cells: 11227; genes: 35483"
Save cell type annotation of selected cells to file IPF_ref_scRNA_cell_celltype.csv.
write.csv(ref_data@meta.data[, 'celltype', drop=F], 'IPF_ref_scRNA_cell_celltype.csv')
print(sprintf('save cell type annotation of reference scRNA-seq cells into file %s', 'IPF_ref_scRNA_cell_celltype.csv'))[1] "save cell type annotation of reference scRNA-seq cells into file IPF_ref_scRNA_cell_celltype.csv"
Save scRNA-seq nUMI matrix to file IPF_ref_scRNA_cell_nUMI.csv.gz
ref_df = as.data.frame(as.matrix(Matrix::t(ref_data@assays$RNA@counts)), check.names=F)
ref_df[1:5, 1:5]
data.table::fwrite(ref_df, 'IPF_ref_scRNA_cell_nUMI.csv.gz', row.names = T)
Written 21.6% of 11227 rows in 2 secs using 4 threads. maxBuffUsed=2%. ETA 7 secs.
Written 31.3% of 11227 rows in 3 secs using 4 threads. maxBuffUsed=2%. ETA 6 secs.
Written 41.1% of 11227 rows in 4 secs using 4 threads. maxBuffUsed=2%. ETA 5 secs.
Written 50.4% of 11227 rows in 5 secs using 4 threads. maxBuffUsed=2%. ETA 4 secs.
Written 56.9% of 11227 rows in 6 secs using 4 threads. maxBuffUsed=2%. ETA 4 secs.
Written 63.5% of 11227 rows in 7 secs using 4 threads. maxBuffUsed=2%. ETA 4 secs.
Written 69.6% of 11227 rows in 8 secs using 4 threads. maxBuffUsed=2%. ETA 3 secs.
Written 75.8% of 11227 rows in 9 secs using 4 threads. maxBuffUsed=2%. ETA 2 secs.
Written 82.2% of 11227 rows in 10 secs using 4 threads. maxBuffUsed=2%. ETA 2 secs.
Written 89.1% of 11227 rows in 11 secs using 4 threads. maxBuffUsed=2%. ETA 1 secs.
Written 96.2% of 11227 rows in 12 secs using 4 threads. maxBuffUsed=2%. ETA 0 secs.
print(sprintf('save nUMI matrix of reference scRNA-seq cells into gzip compressed file %s', 'IPF_ref_scRNA_cell_nUMI.csv.gz'))[1] "save nUMI matrix of reference scRNA-seq cells into gzip compressed file IPF_ref_scRNA_cell_nUMI.csv.gz"